Mixed Models Exercises 💻 Practicum 1

Master Applied Data Science

Authors
Affiliation
Rebecca Stellato

Utrecht University

Duco Veen
Ernst Paul Swens

Description

During this computer lab you will:

  1. Reproduce the results from the lectures.
  2. Investigate how centering time affects the results.
  3. Explore the possibility of a quadratic effect of time.
  4. Adjust the analysis to account for baseline scores.
  5. Analyze another dataset related to stroke data.

Before we get started, load the following packages.

library(tidyverse)
library(nlme)

Exercise 1

In this assignment, you will replicate the analyses of the Reisby dataset featured in the slides, with a focus on modeling time as a continuous variable.

The Reisby dataset includes weekly depression scores that were collected over a span of \(6\) weeks from a sample of \(66\) patients experiencing various types of depression.

Variable Description Values
id Patient id 101-610
week Time variable 0-5
endo Types of depression 0=exogenous 1=endogenous depression
hdrs Depression scores 0-52

The dataset is available in two formats: wide format (reisby_wide.Rdata) and long format (reisby_long.Rdata). The wide format contains all observations in separate rows, while the long format includes observations from different time points on separate lines, with six lines per patient.

load("data/reisby_wide.Rdata")
head(reisby.wide, n = 3)
   id endo hdrs.0 hdrs.1 hdrs.2 hdrs.3 hdrs.4 hdrs.5
1 101    0     26     22     18      7      4      3
2 103    0     33     24     15     24     15     13
3 104    1     29     22     18     13     19      0
load("data/reisby_long.Rdata")
head(reisby.long, n = 6)
   id hdrs week endo
1 101   26    0    0
2 101   22    1    0
3 101   18    2    0
4 101    7    3    0
5 101    4    4    0
6 101    3    5    0

You will need to use the long format data for the mixed models, while some of the descriptive analyses may be easier to perform using the wide format data.


a) Import the data using the load() function.

load("data/reisby_long.Rdata")
load("data/reisby_wide.Rdata")

b) Perform a short data exploration on the Reisby data.

Provide descriptive statistics and create plots of the longitudinal data.

Obtaining descriptive statistics, such as means, standard deviations, and correlations, is generally easier using the wide format of the data. On the other hand, the long format is more suitable for generating spaghetti plots and individual plots.

We will begin by examining basic descriptive statistics for each variable.

summary(reisby.wide)
       id             endo            hdrs.0          hdrs.1     
 Min.   :101.0   Min.   :0.0000   Min.   :15.00   Min.   :11.00  
 1st Qu.:303.2   1st Qu.:0.0000   1st Qu.:21.00   1st Qu.:19.00  
 Median :332.0   Median :1.0000   Median :22.00   Median :22.00  
 Mean   :324.0   Mean   :0.5606   Mean   :23.44   Mean   :21.84  
 3rd Qu.:353.8   3rd Qu.:1.0000   3rd Qu.:27.00   3rd Qu.:25.00  
 Max.   :610.0   Max.   :1.0000   Max.   :34.00   Max.   :39.00  
                                  NA's   :5       NA's   :3      
     hdrs.2          hdrs.3          hdrs.4          hdrs.5     
 Min.   :10.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
 1st Qu.:14.00   1st Qu.:12.00   1st Qu.: 9.00   1st Qu.: 6.25  
 Median :18.00   Median :16.00   Median :12.00   Median :11.00  
 Mean   :18.31   Mean   :16.42   Mean   :13.62   Mean   :11.95  
 3rd Qu.:22.00   3rd Qu.:21.00   3rd Qu.:19.00   3rd Qu.:16.75  
 Max.   :33.00   Max.   :32.00   Max.   :34.00   Max.   :33.00  
 NA's   :1       NA's   :1       NA's   :3       NA's   :8      

Subsequently, we will examine the correlations among the hdrs scores.

reisby.wide %>%
  select(starts_with("hdrs")) %>%
  cor(use = "pairwise.complete.obs") %>%
  round(digits = 3)
       hdrs.0 hdrs.1 hdrs.2 hdrs.3 hdrs.4 hdrs.5
hdrs.0  1.000  0.493  0.410  0.333  0.227  0.184
hdrs.1  0.493  1.000  0.494  0.412  0.308  0.218
hdrs.2  0.410  0.494  1.000  0.738  0.669  0.461
hdrs.3  0.333  0.412  0.738  1.000  0.817  0.568
hdrs.4  0.227  0.308  0.669  0.817  1.000  0.654
hdrs.5  0.184  0.218  0.461  0.568  0.654  1.000

Next, we will generate a spaghetti plot that displays the hdrs scores over time for each subject.

reisby.long %>% 
  ggplot(aes(x = week, y = hdrs, group = id, color = factor(id))) + 
  geom_line(linewidth = 0.5) +
  geom_point(alpha = 0.5) + 
  theme_bw() +
  theme(legend.position = "None") 

It is also possible to generate a spaghetti plot for each type of depression.

reisby.long %>% 
  mutate(endo = factor(
    endo, levels = c(0, 1), labels = c("exogenous", "endogenous"))) %>%
  ggplot(aes(x = week, y = hdrs, group = id, color = factor(id))) + 
  geom_line(linewidth = 0.5) +
  geom_point(alpha = 0.5) +
  theme_bw() +
  theme(legend.position = "None") +
  facet_grid(cols = vars(endo))

Lastly, we will examine a plot that displays the hdrs scores over time for each individual.

reisby.long %>% 
  mutate(id = factor(id)) %>% 
  ggplot(aes(x = week, y = hdrs, group = id)) + 
  geom_line(linewidth = 0.5, color = "black") +
  geom_smooth(
    formula = "y ~ x",
    method = "lm",
    se = FALSE) + 
  geom_point() + 
  facet_wrap(vars(id)) +
  theme_bw()

Upon examining the individual plots, it may not be immediately apparent whether modeling each individual’s hdrs score as a linear function of time is reasonable or not. We will investigate this further in exercise 3. For this exercise, we will assume that there is a linear time trend for each individual.


c) Formulate two possible hypotheses regarding the effect of endogenous depression (endo)?

One solution could be:

\(H_1:\) There is a main effect of endo, e.g., people with endogenous depression generally have higher or lower hdrs scores than those with exogenous depression (\(\beta_{\text{endo}} \neq 0\)).
\(H_0:\) There is no main effect of endo (\(\beta_{\text{endo}} = 0\)).

Another solution could be:

\(H_1:\) There is an interaction between endo and time, e.g., people with endogenous depression have a different time trend for hdrs scores compared to people with exogenous depression (\(\beta_{\text{endo} \times \text{time}} \neq 0\)).
\(H_0:\) There is no interaction between endo and time (\(\beta_{\text{endo} \times \text{time}} = 0\)).


d) Fit a mixed model on the Reisby dataset using the lme() function.

Call this model fit_1. In this model, estimate the depression score (hdrs) by including fixed effects for time (week), type of depression (endo), and the interaction of time and type of depression. Additionally, add a random intercept for each patient. Set the parameter na.action = "na.omit" and method = "ML".

Use ?lme() to inspect the documentation how to define a mixed-effects model.

Make sure to include week as a continuous variable in the model and not as a factor.

fit_1 <- lme(
  fixed     = hdrs ~ week + endo + week:endo, 
  random    = ~ 1 | id, 
  data      = reisby.long, 
  na.action = "na.omit", 
  method    = "ML")

summary(fit_1)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long 
       AIC      BIC    logLik
  2294.137 2317.699 -1141.069

Random effects:
 Formula: ~1 | id
        (Intercept) Residual
StdDev:    3.909812 4.362878

Fixed effects:  hdrs ~ week + endo + week:endo 
                Value Std.Error  DF    t-value p-value
(Intercept) 22.441628 0.9441419 307  23.769337  0.0000
week        -2.351842 0.1990804 307 -11.813530  0.0000
endo         1.992873 1.2702442  64   1.568890  0.1216
week:endo   -0.044176 0.2720416 307  -0.162387  0.8711
 Correlation: 
          (Intr) week   endo  
week      -0.524              
endo      -0.743  0.390       
week:endo  0.384 -0.732 -0.531

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-3.18214138 -0.59376247 -0.03548597  0.53988343  3.48261468 

Number of Observations: 375
Number of Groups: 66 

Optionally, we can also generate a plot of the fitted values, where the points represent the original data points and the lines represent the predicted values.

reisby.long %>%
  mutate(
    prediction = predict(fit_1, newdata = reisby.long),
    endo = factor(endo)) %>%
  ggplot(aes(x = week, y = prediction, group = id, color = endo)) + 
  geom_point(aes(x = week, y = hdrs), alpha = 0.5) + 
  geom_line() + 
  theme_bw() +
  theme(legend.position = "top") +
  ylab("hdrs")

It is worth noting that each individual has a different intercept. However, it appears that the predicted lines for all 66 patients are completely parallel. This is because the interaction between week:endo is nearly zero, which means that the differences in slope between individuals with endogenous and exogenous depression are barely noticeable to the naked eye.


e) Fit another mixed model using the lme() function.

Call this model fit_2. This second model shares the same fixed-effects structure as the previous model. However, unlike the previous model, this model includes not only a random intercept but also a random slope for time per patient. Again, set the parameter na.action = "na.omit" and method = "ML".

fit_2 <- lme(
  fixed     = hdrs ~ week + endo + week:endo, 
  random    = ~ week | id, 
  data      = reisby.long, 
  na.action = "na.omit", 
  method    = "ML")

summary(fit_2)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long 
       AIC      BIC    logLik
  2230.929 2262.345 -1107.465

Random effects:
 Formula: ~week | id
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev   Corr  
(Intercept) 3.411897 (Intr)
week        1.441191 -0.285
Residual    3.495500       

Fixed effects:  hdrs ~ week + endo + week:endo 
                Value Std.Error  DF   t-value p-value
(Intercept) 22.476263 0.7986137 307 28.144100  0.0000
week        -2.365687 0.3134843 307 -7.546430  0.0000
endo         1.988021 1.0747917  64  1.849680  0.0690
week:endo   -0.027056 0.4217255 307 -0.064155  0.9489
 Correlation: 
          (Intr) week   endo  
week      -0.451              
endo      -0.743  0.335       
week:endo  0.335 -0.743 -0.457

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.74478736 -0.49680979  0.03394173  0.49640589  3.62084094 

Number of Observations: 375
Number of Groups: 66 

Once again, we can create a plot of the fitted values where the points represent the original data points and the lines represent the fitted values.

reisby.long %>%
  mutate(
    prediction = predict(fit_2, newdata = reisby.long),
    endo = factor(endo)) %>%
  ggplot(aes(x = week, y = prediction, group = id, color = endo)) + 
  geom_point(aes(x = week, y = hdrs), alpha = 0.5) + 
  geom_line() + 
  theme_bw() +
  theme(legend.position = "top") +
  ylab("hdrs")

From this plot, we can see that each patient in this model has a unique intercept and slope for time.


f) Determine which model fits the data best using the AIC and likelihood ratio test.

Use the anova() function to compare both models.

When conducting a likelihood ratio test to determine whether adding a random effect improves the model fit, it is important to note that the estimate of the random effect (a variance parameter) cannot be negative. As a result, this is considered a one-sided test, and you should divide the \(p\)-value by \(2\).

anova(fit_1, fit_2)
      Model df      AIC      BIC    logLik   Test  L.Ratio p-value
fit_1     1  6 2294.137 2317.699 -1141.069                        
fit_2     2  8 2230.929 2262.345 -1107.465 1 vs 2 67.20798  <.0001

To compare both models, we can use the Akaike information criterion (AIC), which weights model performance and complexity in a single metric. In model comparison using AIC, the model with the lower AIC value is generally preferred. The AIC values indicated that fit_2 (AIC = 2231) provided a better fit to the data compared to fit_1 (AIC = 2294).

Alternatively, we can use the likelihood-ratio test since these are nested models. This test evaluates the goodness of fit of two competing statistical models by comparing the ratio of their likelihoods. The null hypothesis of the test states that the simpler model provides as good a fit for the data as the more complex model. The likelihood-ratio test showed a significant difference between the two models, indicating that the more complex model fit_2 provided a better fit to the data.


g) Give an interpretation of the best fitting model.

summary(fit_2)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long 
       AIC      BIC    logLik
  2230.929 2262.345 -1107.465

Random effects:
 Formula: ~week | id
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev   Corr  
(Intercept) 3.411897 (Intr)
week        1.441191 -0.285
Residual    3.495500       

Fixed effects:  hdrs ~ week + endo + week:endo 
                Value Std.Error  DF   t-value p-value
(Intercept) 22.476263 0.7986137 307 28.144100  0.0000
week        -2.365687 0.3134843 307 -7.546430  0.0000
endo         1.988021 1.0747917  64  1.849680  0.0690
week:endo   -0.027056 0.4217255 307 -0.064155  0.9489
 Correlation: 
          (Intr) week   endo  
week      -0.451              
endo      -0.743  0.335       
week:endo  0.335 -0.743 -0.457

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.74478736 -0.49680979  0.03394173  0.49640589  3.62084094 

Number of Observations: 375
Number of Groups: 66 

The intercept term represents the average hdrs score when all other variables are equal to \(0\), so patients with exogenous depression at week \(0\). On average patients with exogenous depression start with an hdrs score of \(22.48\). The estimation of endo is the average difference in hdrs scores between endogenous and exogenous patients at week \(0\). Patients with endogenous depression start with average of \(24.46\).

The estimate of the random intercept per patient (\(SD = 3.41\)) indicates considerable fluctuation around the fixed intercepts, meaning patients can start quite a bit higher or lower than average hdrs score at week \(0\).

The average slope for week is \(-2.37\) for patients with exogenous depression. So, every week the hdrs score decreases on average by \(2.37\) for patients with exogenous depression.

The interaction between endo and week (\(-0.027\)) is the difference between the average slopes for week for the endogenous group and exogenous group. So, every week the hdrs score decreases on average by \(2.39\) for patients with endogenous depression.

The estimate of the random slope for week (\(SD = 1.44\)) indicates that the slope for time can be quite a bit steeper or flatter per patient. For some patients the slope for week may even be positive, meaning that the hdrs scores increase over time (as shown in the plot).


h) See if you can reduce the fixed part of this model.

Call this reduced fixed effect model fit_3.

The summary(fit_2) indicates that the interaction term can be dropped.

fit_3 <- lme(
  fixed     = hdrs ~ week + endo, 
  random    = ~ week | id, 
  data      = reisby.long, 
  na.action = "na.omit", 
  method    = "ML")

summary(fit_3)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long 
       AIC      BIC    logLik
  2228.933 2256.422 -1107.467

Random effects:
 Formula: ~week | id
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev   Corr  
(Intercept) 3.411963 (Intr)
week        1.441323 -0.285
Residual    3.495474       

Fixed effects:  hdrs ~ week + endo 
                Value Std.Error  DF    t-value p-value
(Intercept) 22.493440 0.7513921 308  29.935688  0.0000
week        -2.380637 0.2094264 308 -11.367420  0.0000
endo         1.956504 0.9546475  64   2.049451  0.0445
 Correlation: 
     (Intr) week  
week -0.320       
endo -0.704 -0.008

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.74690632 -0.49625552  0.03265459  0.49480574  3.62217196 

Number of Observations: 375
Number of Groups: 66 

All fixed effects in this model are significant, so it is not necessary to further simplify the model.


i) Re-fit the final model using restricted maximum likelihood (REML) by setting method = "REML" and interpret the model.

Call this model fit_4.

fit_4 <- lme(
  fixed     = hdrs ~ week + endo, 
  random    = ~ week | id, 
  data      = reisby.long, 
  na.action = "na.omit", 
  method    = "REML")

summary(fit_4)
Linear mixed-effects model fit by REML
  Data: reisby.long 
       AIC      BIC    logLik
  2228.116 2255.548 -1107.058

Random effects:
 Formula: ~week | id
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev   Corr  
(Intercept) 3.490342 (Intr)
week        1.457808 -0.287
Residual    3.494719       

Fixed effects:  hdrs ~ week + endo 
                Value Std.Error  DF    t-value p-value
(Intercept) 22.492881 0.7598098 308  29.603306  0.0000
week        -2.380472 0.2103154 308 -11.318581  0.0000
endo         1.956867 0.9658720  64   2.026011  0.0469
 Correlation: 
     (Intr) week  
week -0.318       
endo -0.704 -0.008

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.73520482 -0.49503123  0.03559898  0.49317021  3.62063687 

Number of Observations: 375
Number of Groups: 66 

The intercept term represents the average hdrs score when all other variables are equal to \(0\), so patients with exogenous depression at week \(0\). On average patients with exogenous depression start with an hdrs score of \(22.49\). The estimation of endo is the average difference in hdrs scores between endogenous and exogenous patients at week \(0\). Patients with endogenous depression start with average of \(24.45\).

The estimate of the random intercept per patient (\(SD = 3.49\)) indicates considerable fluctuation around the fixed intercepts, meaning patients can start quite a bit higher or lower than average hdrs score at week \(0\).

The average slope for week is \(-2.38\) for patients. So, every week the hdrs score decreases on average by \(2.38\) for patients.

The estimate of the random slope for week (\(SD = 1.46\)) indicates that the slope for time can be quite a bit steeper or flatter per individual. For some patients the slope for week may even be positive, meaning that the hdrs scores increase over time (as shown in the plot).


Exercise 2

In this exercise we repeat the analysis of the Reisby dataset in which time is centered.


a) Create a new variable week_centered that is equal to week - 2.5.

reisby.long <- reisby.long %>% mutate(week_centered = week - 2.5)

b) Fit a mixed model using the lme() function.

Call this model fit_5. In this model, estimate the depression score (hdrs) using the fixed effects for time centered (week_centered) and type of depression (endo). Further, add a random intercept plus a random slope for time per patient. Set the parameter na.action = "na.omit" and use REML estimation.

fit_5 <- lme(
  fixed     = hdrs ~ week_centered + endo, 
  random    = ~ week_centered | id, 
  data      = reisby.long, 
  na.action = "na.omit", 
  method    = "REML")

summary(fit_5)
Linear mixed-effects model fit by REML
  Data: reisby.long 
       AIC      BIC    logLik
  2228.116 2255.548 -1107.058

Random effects:
 Formula: ~week_centered | id
 Structure: General positive-definite, Log-Cholesky parametrization
              StdDev   Corr  
(Intercept)   4.261722 (Intr)
week_centered 1.457853 0.62  
Residual      3.494682       

Fixed effects:  hdrs ~ week_centered + endo 
                  Value Std.Error  DF   t-value p-value
(Intercept)   16.541702 0.7742180 308  21.36569  0.0000
week_centered -2.380472 0.2103196 308 -11.31836  0.0000
endo           1.956867 0.9658913  64   2.02597  0.0469
 Correlation: 
              (Intr) wk_cnt
week_centered  0.367       
endo          -0.697 -0.008

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.73516908 -0.49503779  0.03560095  0.49318682  3.62065135 

Number of Observations: 375
Number of Groups: 66 

c) Compare the results of this model fit_5 to fit_4 from the previous exercise.

Describe which estimates changed, and which did not. And explain why.

The estimate of the fixed-effect intercept changed. In both models, the intercept term represents the average hdrs score when all other variables are equal to \(0\). In fit_4 this represents the average hdrs score for patients with exogenous depression at week \(0\). In fit_5 the intercept term represents the average hdrs score for patients with exogenous depression at week \(2.5\).

Further, the estimate of the random intercept changed. In fit_4 the random intercept represents the fluctuation around the fixed intercepts at week \(0\). In fit_4 the random intercept represents the fluctuation around the fixed intercepts at week \(2.5\). Subsequently, The correlation between the values of the random intercept and random slopes for time also changed.

The plot below shows how both models provide still the same fitted values, however the meaning of \(\text{time} = 0\) is different. On the top figure \(\text{time} = 0\) when week = 0 and on the bottom figure \(\text{time} = 0\) when week = 2.5.

All other estimates remained equivalent between models.


Exercise 3

For this exercise, we will explore a potential quadratic effect of time in the Reisby example. Upon reviewing the spaghetti plots of the data, it seems possible that assuming a linear change in depression scores across time may be overly simplistic. It is possible that the trend in time is actually curvilinear.


a) Fit a mixed model using the lme() function that incorporates a fixed linear effect for time.

Call this model fit_6. In this model, estimate the depression score (hdrs) using a fixed linear effect for time. Use week_centered to model time defined in exercise 2. Further, add a random intercept plus a random slope for time per patient. Set the parameter na.action = "na.omit" and use ML estimation.

fit_6 <- lme(
  fixed     = hdrs ~ week_centered, 
  random    = ~ week_centered | id, 
  data      = reisby.long, 
  na.action = "na.omit", 
  method    = "ML")

summary(fit_6)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long 
       AIC      BIC    logLik
  2231.038 2254.599 -1109.519

Random effects:
 Formula: ~week_centered | id
 Structure: General positive-definite, Log-Cholesky parametrization
              StdDev   Corr  
(Intercept)   4.303369 (Intr)
week_centered 1.441873 0.609 
Residual      3.495206       

Fixed effects:  hdrs ~ week_centered 
                  Value Std.Error  DF   t-value p-value
(Intercept)   17.634279 0.5618142 308  31.38810       0
week_centered -2.377067 0.2092043 308 -11.36242       0
 Correlation: 
              (Intr)
week_centered 0.493 

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.74606041 -0.50161676  0.03322675  0.51774351  3.68336415 

Number of Observations: 375
Number of Groups: 66 

b) Fit a mixed model using the lme() function that includes both a fixed linear effect for time and a quadratic effect of time.

Call this model fit_7. In this model, estimate the depression score (hdrs) using the fixed linear effect for time and quadratic effect of time. Use week_centered to model time. Further, add a random intercept plus a random slope for time (only week_centered) per patient. Set the parameter na.action = "na.omit" and use ML estimation.

You can model the quadratic effect of time with I(week_centered^2).

When incorporating both a quadratic term and a linear term for time in a mixed model, it is advisable to center time first. Doing so can help alleviate collinearity between these fixed effects.

fit_7 <- lme(
  fixed     = hdrs ~ week_centered + I(week_centered^2), 
  random    = ~ week_centered | id, 
  data      = reisby.long, 
  na.action = "na.omit", 
  method    = "ML")

summary(fit_7)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long 
       AIC      BIC    logLik
  2232.626 2260.115 -1109.313

Random effects:
 Formula: ~week_centered | id
 Structure: General positive-definite, Log-Cholesky parametrization
              StdDev   Corr  
(Intercept)   4.303325 (Intr)
week_centered 1.443029 0.61  
Residual      3.492582       

Fixed effects:  hdrs ~ week_centered + I(week_centered^2) 
                       Value Std.Error  DF   t-value p-value
(Intercept)        17.502333 0.5992078 307  29.20912  0.0000
week_centered      -2.374736 0.2095957 307 -11.33008  0.0000
I(week_centered^2)  0.047761 0.0747180 307   0.63921  0.5232
 Correlation: 
                   (Intr) wk_cnt
week_centered       0.458       
I(week_centered^2) -0.345  0.017

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.79360199 -0.49714363  0.03869342  0.49953351  3.69487074 

Number of Observations: 375
Number of Groups: 66 

c) Perform a likelihood-ratio test to determine whether adding a fixed quadratic time term improves the model.

anova(fit_6, fit_7)
      Model df      AIC      BIC    logLik   Test   L.Ratio p-value
fit_6     1  6 2231.037 2254.599 -1109.519                         
fit_7     2  7 2232.626 2260.115 -1109.313 1 vs 2 0.4114832  0.5212

Based on the results of the likelihood ratio test, we would conclude that there is no significant improvement in model fit when adding a fixed quadratic effect for time in fit_7, compared to the model fit_6 with only a linear effect.


d) Fit a mixed model using the lme() function.

Although the previous analysis suggested that a model with a fixed quadratic effect for time was not significantly better than a model with only a linear effect. Create a new model where you add a quadratic random effect for time.

Create a new mixed model named fit_8 that estimates the depression score (hdrs) using a fixed linear effect for time and a quadratic effect of time. To model time, use the week_centered variable. Additionally, include random intercepts and random slopes for both time and time squared per patient. Specify na.action = "na.omit" as a parameter and use ML estimation.

fit_8 <- lme(
  fixed     = hdrs ~ week_centered + I(week_centered^2), 
  random    = ~ week_centered + I(week_centered^2) | id, 
  data      = reisby.long, 
  na.action = "na.omit", 
  method    = "ML")

summary(fit_8)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long 
       AIC      BIC    logLik
  2227.648 2266.917 -1103.824

Random effects:
 Formula: ~week_centered + I(week_centered^2) | id
 Structure: General positive-definite, Log-Cholesky parametrization
                   StdDev    Corr         
(Intercept)        4.9246191 (Intr) wk_cnt
week_centered      1.4548925  0.504       
I(week_centered^2) 0.4401575 -0.573  0.050
Residual           3.2428349              

Fixed effects:  hdrs ~ week_centered + I(week_centered^2) 
                       Value Std.Error  DF    t-value p-value
(Intercept)        17.500568 0.6605456 307  26.494111  0.0000
week_centered      -2.375170 0.2074442 307 -11.449681  0.0000
I(week_centered^2)  0.051481 0.0887028 307   0.580379  0.5621
 Correlation: 
                   (Intr) wk_cnt
week_centered       0.400       
I(week_centered^2) -0.553  0.046

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.62118644 -0.52641734  0.01065469  0.47660417  3.83483483 

Number of Observations: 375
Number of Groups: 66 

e) Perform a likelihood ratio test to determine whether adding a quadratic random term for time improves the model.

anova(fit_6, fit_7, fit_8)
      Model df      AIC      BIC    logLik   Test   L.Ratio p-value
fit_6     1  6 2231.037 2254.599 -1109.519                         
fit_7     2  7 2232.626 2260.115 -1109.313 1 vs 2  0.411483  0.5212
fit_8     3 10 2227.648 2266.917 -1103.824 2 vs 3 10.978143  0.0118

Based on either the AIC or LRT, we can conclude that fit_7, which includes a fixed quadratic effect for time, does not fit significantly better than fit_6, which only includes a linear effect.

However, adding a quadratic random term for time improves the model fit significantly.


f) The results of the previous model comparison suggest that although the average depression score trend across time is essentially linear, it is curvilinear at the individual level. Can you explain this phenomenon?

There are likely similar numbers of individuals with convex and concave curves (see colored lines). However, on average, we have a linear time trend (black line).


Exercise 4

For this assignment, you will be conducting an analysis on the Reisby dataset while using baseline as a covariate.

To begin, you will have to convert the provided wide dataset into a long format. However, do not pivot hdrs.0 into a longer format; rather, include a duplicated column for the hdrs.0 (baseline) measurement.

Below are the first few rows of what the resulting dataset should look like.

# A tibble: 7 × 5
     id  endo  time  hdrs hdrs_baseline
  <dbl> <dbl> <dbl> <dbl>         <dbl>
1   101     0     0    22            26
2   101     0     1    18            26
3   101     0     2     7            26
4   101     0     3     4            26
5   101     0     4     3            26
6   103     0     0    24            33
7   103     0     1    15            33

It is important to note that in this format, \(time=0\) corresponds to the first week of the study.


a) Create a dataset in long format, including a column for the baseline (hdrs.0) measure.

Call this reisby.long.baseline.

You can apply the pivot_longer() function on the reisby.wide dataset. Set the parameter names_pattern = "hdrs.(.)".

reisby.long.baseline <- reisby.wide %>%
  pivot_longer(
    -c(id, endo, hdrs.0), 
    names_to = "time",
    names_pattern = "hdrs.(.)",
    values_to = "hdrs") %>%
  mutate(
    time = as.numeric(time) - 1,
    hdrs_baseline = hdrs.0, .keep = "unused") 

b) Fit a mixed model using the lme() function.

Create a new mixed model named fit_9 that estimates the depression score (hdrs) using a fixed linear effect for time, type of depression (endo), and baseline hdrs scores. Additionally, include random intercepts and random slopes for time per patient. Specify na.action = "na.omit" as a parameter and use ML estimation.

fit_9 <- lme(
  fixed     = hdrs ~ time + endo + hdrs_baseline, 
  random    = ~ time | id, 
  data      = reisby.long.baseline, 
  na.action = "na.omit", 
  method    = "ML")

summary(fit_9)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long.baseline 
      AIC      BIC    logLik
  1729.06 1758.391 -856.5298

Random effects:
 Formula: ~time | id
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev   Corr  
(Intercept) 2.855018 (Intr)
time        1.639083 0.081 
Residual    3.476704       

Fixed effects:  hdrs ~ time + endo + hdrs_baseline 
                  Value Std.Error  DF   t-value p-value
(Intercept)    8.951304 2.6552010 227  3.371234  0.0009
time          -2.435033 0.2599968 227 -9.365628  0.0000
endo           1.864714 0.9972006  58  1.869948  0.0665
hdrs_baseline  0.491875 0.1121986  58  4.383964  0.0000
 Correlation: 
              (Intr) time   endo  
time          -0.045              
endo          -0.067 -0.011       
hdrs_baseline -0.961 -0.008 -0.135

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.41006631 -0.47728134  0.05821129  0.48495836  3.14324625 

Number of Observations: 289
Number of Groups: 61 

c) Interpret, carefully, the results.

summary(fit_9)
Linear mixed-effects model fit by maximum likelihood
  Data: reisby.long.baseline 
      AIC      BIC    logLik
  1729.06 1758.391 -856.5298

Random effects:
 Formula: ~time | id
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev   Corr  
(Intercept) 2.855018 (Intr)
time        1.639083 0.081 
Residual    3.476704       

Fixed effects:  hdrs ~ time + endo + hdrs_baseline 
                  Value Std.Error  DF   t-value p-value
(Intercept)    8.951304 2.6552010 227  3.371234  0.0009
time          -2.435033 0.2599968 227 -9.365628  0.0000
endo           1.864714 0.9972006  58  1.869948  0.0665
hdrs_baseline  0.491875 0.1121986  58  4.383964  0.0000
 Correlation: 
              (Intr) time   endo  
time          -0.045              
endo          -0.067 -0.011       
hdrs_baseline -0.961 -0.008 -0.135

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.41006631 -0.47728134  0.05821129  0.48495836  3.14324625 

Number of Observations: 289
Number of Groups: 61 

The intercept term represents the average hdrs score for patients with exogenous depression and a baseline hdrs score of \(0\) at week \(1\).

On average at week \(1\), patients with exogenous depression and a hdrs baseline score of \(0\) start with a hdrs score of \(8.95\). Patients with endogenous depression and a hdrs baseline of \(0\) start on average at week \(1\) with a hdrs score of \(10.82\).

The result show that a \(1\) unit increase in baseline hdrs score corresponds with an increase of \(0.49\) units in hdrs scores. This indicates that patients with higher baseline hdrs score tend to have higher hdrs scores.

The estimate of the random intercept per patient (\(SD = 2.86\)) indicates considerable fluctuation around the fixed intercepts, meaning patients can start quite a bit higher or lower than average baseline adjusted hdrs score at week \(1\).

The average slope for time is \(-2.44\) for patients. So, every week the hdrs score decreases on average by \(2.44\).

The estimate of the random slope for time (\(SD = 1.64\)) indicates that the slope for time can be quite a bit steeper or flatter per individual.


Exercise 5

The stroke.Rdata file contains data from an experiment designed to enhance the recovery of stroke patients. The study involved three experimental groups:

  1. Group A received a novel occupational therapy intervention.
  2. Group B received the pre-existing stroke rehabilitation program at the same hospital where Group A was treated.
  3. Group C underwent the standard care regimen for stroke patients, which was administered at a different hospital.

Each group had 8 patients, and the response variable was a measure of functional ability, known as the Barthel index. A higher score on this index indicates better outcomes, with a maximum score of 100. Each program lasted for 8 weeks, and all participants were assessed at the beginning of the program and weekly until the end. The hypothesis was that patients in group A would fare better than those in groups B or C.


a) Formulate two possible hypotheses regarding the effect of treatment?


b) Load in the Stroke.Rdata.


c) Convert the data into a long format.

Call this new dataset stroke_long. Pivot all week columns into a longer format.

You can apply the pivot_longer() function on the stroke dataset. Set the parameter names_pattern = "week(.)".


d) Fit a mixed model using the lme() function.

Call this model fit_10. We want to answer two questions with this model. The first question is whether patients in group A have higher scores compared to the other groups. The second question is whether patients in group A recover more quickly than patients in other groups. Include random intercepts and random slopes for time per patient. Specify na.action = "na.omit" as a parameter and use REML estimation.


e) Interpret the model and answer the aforementioned research questions.